-
Notifications
You must be signed in to change notification settings - Fork 935
Improve and re-release chapter 2 #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
]} | ||
/> | ||
# Optimized Inference Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue in the building process. I think it comes from the framework
s tags in this file. I think at least is missing <FrameworkSwitchCourse {fw} />
|
||
The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but as you saw in [Chapter 1](/course/chapter1), this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it's imperative to be able to share and reuse models that have already been trained. | ||
You'll notice that the tokenizer has added special tokens — `[CLS]` and `[SEP]` — required by the model. Not all models need special tokens; they're utilized when a model was pretrained with them, in which case the tokenizer needs to add them as that model expects these tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence feels convoluted
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tgi/flash-attn.png" alt="Flash Attention" /> | ||
|
||
<Tip title="How Flash Attention Works"> | ||
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [section 12.3](2.mdx), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences. | |
Flash Attention is a technique that optimizes the attention mechanism in transformer models by addressing memory bandwidth bottlenecks. As discussed earlier in [Chapter 1.8](/course/chapter1/8), the attention mechanism has quadratic complexity and memory usage, making it inefficient for long sequences. |
|
||
**vLLM** takes a different approach by using PagedAttention. Just like how a computer manages its memory in pages, vLLM splits the model's memory into smaller blocks. This clever system means it can handle different-sized requests more flexibly and doesn't waste memory space. It's particularly good at sharing memory between different requests and reduces memory fragmentation, which makes the whole system more efficient. | ||
|
||
<Tip title="How Paged Attention Works"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintaining consistency in naming:
<Tip title="How Paged Attention Works"> | |
<Tip title="How PagedAttention Works"> |
Super well written and informative, as always @burtenshaw. Great improvement over the previous iteration! Main issue is the failing build process. Rest are just nits 😄 |
Thanks @sergiopaniego . Working on the framework options now. |
Co-authored-by: Sergio Paniego Blanco <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Sergio Paniego Blanco <[email protected]>
This is a minor improvement to chapter 2 to do these things: